Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.
translated by 谷歌翻译
The statistical heterogeneity of the non-independent and identically distributed (non-IID) data in local clients significantly limits the performance of federated learning. Previous attempts like FedProx, SCAFFOLD, MOON, FedNova and FedDyn resort to an optimization perspective, which requires an auxiliary term or re-weights local updates to calibrate the learning bias or the objective inconsistency. However, in addition to previous explorations for improvement in federated averaging, our analysis shows that another critical bottleneck is the poorer optima of client models in more heterogeneous conditions. We thus introduce a data-driven approach called FedSkip to improve the client optima by periodically skipping federated averaging and scattering local models to the cross devices. We provide theoretical analysis of the possible benefit from FedSkip and conduct extensive experiments on a range of datasets to demonstrate that FedSkip achieves much higher accuracy, better aggregation efficiency and competing communication efficiency. Source code is available at: https://github.com/MediaBrain-SJTU/FedSkip.
translated by 谷歌翻译
本文认为很少发生异常检测(FSAD),这是一种实用但研究不足的异常检测设置(AD),在训练中,每个类别仅提供有限数量的正常图像。到目前为止,现有的FSAD研究遵循用于标准AD的单层学习范式,并且尚未探索类别间的共同点。受到人类如何检测异常的启发,即将所讨论的图像与正常图像进行比较,我们在这里利用注册,这是一个固有跨越类别(​​作为代理任务)固有概括的图像对齐任务,以训练类别不稳定的异常异常检测模型。在测试过程中,通过比较测试图像的注册特征及其相应支持(正常)图像来识别异常。据我们所知,这是训练单个可推广模型的第一种FSAD方法,不需要对新类别进行重新训练或参数调整。实验结果表明,在MVTEC和MPDD基准上,所提出的方法在AUC中优于最先进的FSAD方法。
translated by 谷歌翻译
在多模式的多代理轨迹预测中,尚未完全解决两个主要挑战:1)如何测量相互作用模块引起的不确定性,从而导致多个试剂的预测轨迹之间引起相关性; 2)如何对多个预测进行排名并选择最佳预测轨迹。为了应对这些挑战,这项工作首先提出了一个新颖的概念,协作不确定性(CU),该概念模拟了互动模块引起的不确定性。然后,我们使用原始置换量等不确定性估计器来构建一般的CU感知回归框架,以完成回归和不确定性估计任务。此外,我们将提出的框架应用于当前的SOTA多代理多模式预测系统作为插件模块,该模块使SOTA系统能够达到1)估计多代理多模式轨迹预测任务的不确定性; 2)对多个预测进行排名,并根据估计的不确定性选择最佳预测。我们对合成数据集和两个公共大规模多代理轨迹预测基准进行了广泛的实验。实验表明:1)在合成数据集上,Cu-Aware回归框架允许模型适当地近似地面真相拉普拉斯分布; 2)在多代理轨迹预测基准上,Cu-Aware回归框架稳步帮助SOTA系统改善了其性能。特别是,提出的框架帮助Vectornet在Nuscenes数据集中所选最佳预测的最终位移误差方面提高了262 cm; 3)对于多机构多模式轨迹预测系统,预测不确定性与未来随机性呈正相关; 4)估计的CU值与代理之间的交互式信息高度相关。
translated by 谷歌翻译
在本文中,我们提出了一种先进的方法,用于针对单眼3D车道检测的问题,通过在2D至3D车道重建过程下利用几何结构。受到先前方法的启发,我们首先分析了3D车道与其2D表示之间的几何启发式,并提议根据先验的结构进行明确的监督,这使建立车上和车内的关系可以实现,以促进促进。从本地到全球的3D车道的重建。其次,为了减少2D车道表示中的结构损失,我们直接从前视图图像中提取顶视车道信息,从而极大地缓解了以前方法中遥远的车道特征的混淆。此外,我们通过在管道中综合新的培训数据来分割和重建任务,以应对相机姿势和地面斜率的不平衡数据分布,以改善对看不见的数据的概括,以应对我们的管道中的分割和重建任务,以对抗分割和重建任务,从而提出了一种新颖的任务数据增强方法。我们的工作标志着首次尝试使用几何信息到基于DNN的3D车道检测中的尝试,并使其可用于检测超长距离的车道,从而使原始检测范围增加一倍。提出的方法可以由其他框架平稳地采用,而无需额外的成本。实验结果表明,我们的工作表现优于Apollo 3D合成数据集的最先进方法以82 fps的实时速度在不引入额外参数的情况下实时速度为3.8%。
translated by 谷歌翻译
本文考虑了快速MRI重建的问题。我们提出了一个基于变压器的新型框架,用于直接处理K空间中稀疏采样的信号,超出了像Convnets一样的常规网格的限制。我们采用频谱图的隐式表示,将空间坐标视为输入,并动态查询部分观察到的测量值以完成频谱图,即学习K空间中的电感偏置。为了在计算成本和重建质量之间保持平衡,我们分别建立了一个具有低分辨率和高分辨率解码器的层次结构。为了验证我们提出的模块的必要性,我们在两个公共数据集上进行了广泛的实验,并表现出优于最先进方法的卓越或可比性。
translated by 谷歌翻译
自我监督的学习在表示视觉和文本数据的表示方面取得了巨大的成功。但是,当前的方法主要在经过良好策划的数据集中验证,这些数据集未显示现实世界的长尾分布。在损失的角度或模型观点中,重新平衡的重新平衡是为了考虑自我监督的长尾学习的最新尝试,类似于被监督的长尾学习中的范式。然而,没有标签的帮助,由于尾巴样品发现或启发式结构设计的限制,这些探索并未显示出预期的明显希望。与以前的作品不同,我们从替代角度(即数据角度)探索了这个方向,并提出了一种新颖的增强对比度学习(BCL)方法。具体而言,BCL利用深神经网络的记忆效果自动推动对比度学习中样本视图的信息差异,这更有效地增强了标签 - unaware环境中的长尾学习。对一系列基准数据集进行的广泛实验证明了BCL对几种最新方法的有效性。我们的代码可在https://github.com/mediabrain-sjtu/bcl上找到。
translated by 谷歌翻译
在本文中,我们针对零射肿瘤分割的自我监督代表学习。我们提出以下贡献:首先,我们主张零拍摄设置,其中预培训的模型应该直接适用于下游任务,而无需使用任何手动注释。其次,我们从“层分解”中获取灵感,并创新了模拟肿瘤数据的培训制度。第三,我们进行广泛的消融研究,以分析数据模拟中的关键组成部分,并验证不同代理任务的必要性。我们证明,在模拟中具有足够的质地随机化,培训的模型可以毫不费力地推广到分段实际肿瘤数据。第四,我们的方法在不同下游数据集上实现了零射肿瘤分割的优异成果,对于脑肿瘤细分和LITS2017进行肝脏肿瘤分割。在评估低注释制度下评估肿瘤细分的模型可转移性,拟议方法也优于所有现有的自我监督方法,在实际情况下开辟了自我监督学习的使用。
translated by 谷歌翻译
Action recognition with skeleton data has recently attracted much attention in computer vision. Previous studies are mostly based on fixed skeleton graphs, only capturing local physical dependencies among joints, which may miss implicit joint correlations. To capture richer dependencies, we introduce an encoder-decoder structure, called A-link inference module, to capture action-specific latent dependencies, i.e. actional links, directly from actions. We also extend the existing skeleton graphs to represent higherorder dependencies, i.e. structural links. Combing the two types of links into a generalized skeleton graph, we further propose the actional-structural graph convolution network (AS-GCN), which stacks actional-structural graph convolution and temporal convolution as a basic building block, to learn both spatial and temporal features for action recognition. A future pose prediction head is added in parallel to the recognition head to help capture more detailed action patterns through self-supervision. We validate AS-GCN in action recognition using two skeleton data sets, NTU-RGB+D and Kinetics. The proposed AS-GCN achieves consistently large improvement compared to the state-of-the-art methods. As a side product, AS-GCN also shows promising results for future pose prediction. Our code is available at https://github.com/limaosen0/AS-GCN . 1
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译